Identity theft is one of the most profitable crimes committed by felons. In the cyber space, this is commonly\r\nachieved using phishing. We propose here robust server side methodology to detect phishing attacks, called\r\nphishGILLNET, which incorporates the power of natural language processing and machine learning techniques.\r\nphishGILLNET is a multi-layered approach to detect phishing attacks. The first layer (phishGILLNET1) employs\r\nProbabilistic Latent Semantic Analysis (PLSA) to build a topic model. The topic model handles synonym (multiple\r\nwords with similar meaning), polysemy (words with multiple meanings), and other linguistic variations found in\r\nphishing. Intentional misspelled words found in phishing are handled using Levenshtein editing and Google APIs\r\nfor correction. Based on term document frequency matrix as input PLSA finds phishing and non-phishing topics\r\nusing tempered expectation maximization. The performance of phishGILLNET1 is evaluated using PLSA fold in\r\ntechnique and the classification is achieved using Fisher similarity. The second layer of phishGILLNET\r\n(phishGILLNET2) employs AdaBoost to build a robust classifier. Using probability distributions of the best PLSA\r\ntopics as features the classifier is built using AdaBoost. The third layer (phishGILLNET3) further expands\r\nphishGILLNET2 by building a classifier from labeled and unlabeled examples by employing Co-Training.\r\nExperiments were conducted using one of the largest public corpus of email data containing 400,000 emails.\r\nResults show that phishGILLNET3 outperforms state of the art phishing detection methods and achieves F-measure\r\nof 100%. Moreover, phishGILLNET3 requires only a small percentage (10%) of data be annotated thus saving\r\nsignificant time, labor, and avoiding errors incurred in human annotation.
Loading....